Skip to content

feat: per-pick confidence scores + abstention (Phase 2.4)#21

Merged
hallelx2 merged 3 commits into
mainfrom
feat/confidence-and-abstention
May 27, 2026
Merged

feat: per-pick confidence scores + abstention (Phase 2.4)#21
hallelx2 merged 3 commits into
mainfrom
feat/confidence-and-abstention

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 27, 2026

Summary

  • Selection JSON schema now accepts a picks: [{id, confidence}] shape carrying per-pick confidence in [0.0, 1.0]. The legacy selected_section_ids shape still parses so older / weaker models keep working.
  • When every confidence falls strictly below retrieval.abstain.below (default 0.4), /v1/query returns an abstention response (sections: [], abstained: true) and /v1/answer skips synthesis entirely and answers with a canonical refusal.
  • Successful responses surface a confidences map keyed by section_id. Abstention responses additionally carry abstention_reason, min_confidence_threshold, and candidate_confidences.

Design rationale

  • Additive on the strategy boundary. Result.SelectedIDs stays []tree.SectionIDConfidences is a separate map[SectionID]float64 field, omitted from JSON when empty. Callers that don't care about confidence see no API change.
  • Strategies never abstain. Each strategy populates Result.Confidences if the model returned them; the abstention decision lives entirely in the API layer (internal/api/server.go). This keeps the strategies pure ("return what the model picked") and confines policy to one place.
  • Abstention requires explicit signal. A nil confidence map (legacy LLM response, or new shape with no confidence keys populated) is the "no signal" sentinel. The abstention check returns false for nil / empty maps so older models cannot accidentally trip a refusal.
  • "All picks below" semantics. If even one pick scored at-or-above the threshold, the engine has enough signal to surface evidence — abstention is reserved for the case where every candidate is weak. This matches the plan and avoids over-refusal.
  • Trace-token absence on abstention. Replay isn't meaningful for an abstention (there's no retrieval result to reproduce), so abstention responses omit trace_token and aren't written to the replay store.

Opt-out / configuration

Knob Where Default
Per-call enable_abstain: false on the /v1/query or /v1/answer request body absent (use config)
Server retrieval.abstain.enabled: false in config.yaml true (opt-out)
Env VLE_RETRIEVAL_ABSTAIN_ENABLED=false unset
Threshold retrieval.abstain.below: 0.5 / VLE_RETRIEVAL_ABSTAIN_BELOW=0.5 0.4

Test plan

  • go build ./... clean
  • go vet ./... clean
  • go test ./... all green (all pre-existing tests pass + new coverage)
  • New parse-side tests: new-shape, legacy-shape, mixed-shape, clamped, deduped, new-shape-no-confidences
  • New strategy tests: SinglePass / ChunkedTree / Agentic surface Confidences, strategies never abstain
  • New config tests: defaults, env overrides (enable / disable / parse / edge), bad-input rejection, validation
  • New API tests: shouldAbstain predicate, helper sentinels, respondAbstained shape (query + answer), synthesis LLM tripwire (must not be called on abstention path), trace_token absent on abstention

Before / after examples

LLM new-shape response → confidences populated

// LLM returns:
{"picks":[{"id":"sec_a","confidence":0.82},{"id":"sec_b","confidence":0.31}],"reasoning":"x"}

// /v1/query response (no abstention because 0.82 ≥ 0.4):
{
  "document_id": "...",
  "sections": [{"id": "sec_a", ...}, {"id": "sec_b", ...}],
  "confidences": {"sec_a": 0.82, "sec_b": 0.31},
  ...
}

LLM all-low response → abstained

// LLM returns:
{"picks":[{"id":"sec_a","confidence":0.12},{"id":"sec_b","confidence":0.20}]}

// /v1/query response:
{
  "document_id": "...",
  "query": "...",
  "strategy": "chunked-tree",
  "sections": [],
  "abstained": true,
  "abstention_reason": "no candidate section scored above the confidence threshold",
  "min_confidence_threshold": 0.4,
  "candidate_confidences": {"sec_a": 0.12, "sec_b": 0.20}
}

// /v1/answer response (synthesis skipped):
{
  "document_id": "...",
  "answer": "I cannot answer this question from the supplied document.",
  "citations": [],
  "abstained": true,
  ...
}

Mixed-shape response handled

// LLM returns (some picks with confidence, some without):
{"picks":[{"id":"sec_a","confidence":0.9},{"id":"sec_b"},{"id":"sec_c","confidence":0.4}]}

// confidences map surfaces only present scores:
{"confidences": {"sec_a": 0.9, "sec_c": 0.4}, ...}
// sec_b is in sections[] but absent from confidences (no signal for that pick)

Legacy response → no abstention

// LLM returns:
{"selected_section_ids":["sec_a","sec_b"]}

// /v1/query response: NO confidences map, NO abstention check fires.
// Older models continue to work unchanged.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added confidence-driven abstention: when all candidate section confidences fall below a configured threshold, the API returns an abstention response with empty results instead of uncertain answers.
    • New enable_abstain parameter on query and answer endpoints for per-request abstention override control.
    • Responses now include per-section confidence scores for transparency.
  • Configuration

    • New retrieval.abstain configuration block with enabled toggle and below confidence threshold (default: 0.4, range: 0.0–1.0).

Review Change Stack

hallelx2 added 3 commits May 27, 2026 03:08
Extend the selection JSON schema to accept either the legacy
{selected_section_ids: [...]} shape or the new
{picks: [{id, confidence}]} shape with per-pick confidence in
[0.0, 1.0]. ParseSelection returns (ids, confidences, err); legacy
responses surface confidences=nil so callers can distinguish "no
confidence signal" from "all confidences low".

Each strategy plumbs the confidence map through:
- SinglePass fills Result.Confidences from the parsed map, filtered
  against the post-FilterKnownIDs survivors.
- ChunkedTree unions per-slice confidence maps (max-wins on duplicate
  IDs across overlapping slices) and filters to the merged ID set.
- Agentic accepts both done-shape variants. The new picks shape
  surfaces per-pick confidences on the final Result.

Result.SelectedIDs stays []tree.SectionID — the change is purely
additive. Callers that don't care about confidence see no API change.
The strategy never abstains; the API layer's abstention check (next
commit) is the only place "all confidences below threshold" becomes
an abstention response.

Tests cover: new-shape parse, legacy-shape parse, mixed-shape parse
(some picks with confidence, some without), confidence clamping,
duplicate-pick dedup, per-strategy fill, chunked-tree merge, and the
agentic done-with-picks path.
…verrides

AbstainBlock carries Enabled + Below (the [0.0, 1.0] confidence
threshold below which picks count as "not confident"). When the
selection LLM returns explicit per-pick confidence and EVERY pick
falls below Below, the API layer surfaces an abstention response
instead of pretending the document held an answer.

Defaults: Enabled=true (opt-out), Below=0.4. Env overrides:
VLE_RETRIEVAL_ABSTAIN_ENABLED (truthy/falsy), VLE_RETRIEVAL_ABSTAIN_BELOW
(float in [0,1]). Validation rejects out-of-range Below values; bad
env strings preserve the default rather than zeroing the field.

Tests cover defaults, env overrides (enable/disable/parse), edge
cases (0.0, 1.0 inclusive), bad-input rejection, and validation.
When the selection LLM returns per-pick confidences and every pick
falls strictly below retrieval.abstain.below (default 0.4), the API
layer skips the normal path and returns an abstention response:

  /v1/query  → sections: [], abstained: true,
               abstention_reason, min_confidence_threshold,
               candidate_confidences
  /v1/answer → answer: "I cannot answer this question from the
               supplied document.", citations: [],
               same abstention fields, synthesis LLM call skipped
               entirely (planning + retrieval usage carried through)

The "all picks below" semantics is deliberate: if even one section
scored at-or-above the threshold the engine surfaces it as evidence.
Abstention is reserved for the case where every candidate is weak.

Abstention requires explicit confidence signal — legacy-shape LLM
responses (no confidence map) always fall through to the normal
path. Per-request `enable_abstain` body field overrides the server
config; opt out globally via retrieval.abstain.enabled: false.

Other changes:
- Result.Confidences threads through the Decomposer (multi-hop
  plans union confidences max-wins on overlap).
- Successful (non-abstained) responses surface a `confidences` map
  on the wire when the model returned them.
- Abstention responses carry no trace_token — there is no retrieval
  result to replay.
- cmd/engine wires cfg.Retrieval.Abstain into the Deps.

Tests cover: shouldAbstain predicate (all-below, one-above,
boundary, nil/empty); filterConfidencesToIDs sentinel preservation;
stringKeyedConfidences conversion; abstentionEnabled body-override
precedence; respondAbstained / respondAbstainedAnswer shape;
synthesis tripwire (LLM must not be called on abstention path);
trace_token absence on abstention.

OpenAPI:
- enable_abstain on QueryRequest + AnswerRequest.
- abstained, abstention_reason, min_confidence_threshold,
  candidate_confidences, confidences on both response schemas.
Copilot AI review requested due to automatic review settings May 27, 2026 02:22
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @hallelx2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR implements confidence-driven abstention across the retrieval engine. Selection LLMs now return per-section confidence scores alongside selected IDs, which flow through all retrieval strategies and decomposer. The API evaluates whether all confidences fall below a configurable threshold, and if so, returns an abstention response (empty sections/answer) instead of weak grounding, with per-request override support.

Changes

Confidence-Driven Abstention

Layer / File(s) Summary
Abstention Configuration
pkg/config/config.go, pkg/config/config_test.go, config.example.yaml
AbstainBlock type with enabled toggle and below threshold (0.0–1.0); defaults to enabled at 0.4; environment overrides and validation enforce bounds.
Retrieval Result & Selection Parser
pkg/retrieval/strategy.go, pkg/retrieval/single_pass.go, pkg/retrieval/retrieval_test.go
Result.Confidences map added (omitted when absent). Selection parser updated to return (ids, confidences, error) and support new picks JSON shape with per-ID confidence (clamped, deduped) alongside legacy selected_section_ids fallback.
Single-Pass Strategy Confidence Flow
pkg/retrieval/single_pass.go
Selection prompt and schema extended to prefer picks format with confidence constraints. runSelectionWithRetry returns confidences; parsing logic normalizes dual-format responses and filters confidences to final deduplicated IDs.
Chunked-Tree Multi-Slice Confidence Merge
pkg/retrieval/chunked_tree.go
Per-slice result struct includes confidence map; slice goroutines capture confidences; merge stage unions confidences across slices using max-wins rule per section ID.
Agentic Strategy Confidence Picks
pkg/retrieval/agentic.go, pkg/retrieval/agentic_test.go
done action now accepts picks array with per-ID confidence (preferred format, clamped to [0.0, 1.0]) or falls back to legacy picked_ids. System prompt and action protocol instruct model on confidence scoring. Result includes filtered confidences.
Decomposer Multi-Hop Confidence Union
pkg/retrieval/decompose.go
DecomposedSelect delegates to new DecomposedSelectWithConfidences; multi-hop execution unions per-sub-question confidences (max-wins per section) and returns (ids, confidences, usage) while preserving first-seen ID order.
API Server Abstention Decision & Shaping
internal/api/server.go
Deps struct adds Abstain config; query/answer request bodies accept enable_abstain override. Selection refactored to return (ids, confidences, usage). When enabled and all confidences below threshold, routes to abstention response (empty sections/citations, abstained=true, refusal text for answer, omitted trace_token) instead of proceeding to re-rank/synthesis. Confidence maps filtered to final IDs and included in responses when present.
API Abstention Tests & OpenAPI Spec
internal/api/abstention_test.go, openapi.yaml
Tests verify threshold logic, confidence filtering, request override semantics, response shapes (abstained flag, reason, threshold value, empty sections, candidate/final confidences), and that abstention skips LLM synthesis. OpenAPI documents enable_abstain parameter, confidence and abstention fields, refusal semantics, and trace_token behavior.
Engine Integration
cmd/engine/main.go
Configured Retrieval.Abstain wired into api.Deps for request handling.

Sequence Diagram(s)

sequenceDiagram
  participant HTTPRequest
  participant handleQuery
  participant runSelection
  participant shouldAbstain
  participant respondAbstained
  HTTPRequest->>handleQuery: enable_abstain override + query
  handleQuery->>runSelection: retrieve sections + confidences
  runSelection-->>handleQuery: (selectedIDs, confidences, usage)
  handleQuery->>shouldAbstain: (confidences, threshold, enabled)
  shouldAbstain-->>handleQuery: all below threshold?
  alt abstain
    handleQuery->>respondAbstained: shape abstention response
    respondAbstained-->>HTTPRequest: 200 OK, abstained=true, empty sections
  else continue
    handleQuery->>handleQuery: proceed to re-rank/synthesis
    handleQuery-->>HTTPRequest: 200 OK, sections/answer with confidences
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A rabbit hops through confidence and doubt,
Scoring each section, filtering out,
When wisdom falters below the line,
It knows to abstain—a choice so fine!
Multi-hop queries now fear not the fog,
With traces of trust through every log. 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: per-pick confidence scores + abstention (Phase 2.4)' accurately summarizes the two main features added: per-pick confidence scores from selection LLM and an API-layer abstention mechanism based on confidence thresholds.
Docstring Coverage ✅ Passed Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/confidence-and-abstention

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hallelx2 hallelx2 merged commit eac87c6 into main May 27, 2026
5 of 9 checks passed
@hallelx2 hallelx2 deleted the feat/confidence-and-abstention branch May 27, 2026 02:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants